Attention (machine Learning)
   HOME

TheInfoList



OR:

In
artificial neural networks Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected unit ...
, attention is a technique that is meant to mimic cognitive attention. The effect enhances some parts of the input data while diminishing other parts — the motivation being that the network should devote more focus to the small, but important, parts of the data. Learning which part of the data is more important than another depends on the context, and this is trained by
gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...
. Attention-like mechanisms were introduced in the 1990s under names like multiplicative modules, sigma pi units, and hyper-networks. Its flexibility comes from its role as "soft weights" that can change during runtime, in contrast to standard weights that must remain fixed at runtime. Uses of attention include memory in
neural Turing machine A Neural Turing machine (NTM) is a recurrent neural network model of a Turing machine. The approach was published by Alex Graves et al. in 2014. NTMs combine the fuzzy pattern matching capabilities of neural networks with the algorithmic power of ...
s, reasoning tasks in
differentiable neural computer In artificial intelligence, a differentiable neural computer (DNC) is a memory augmented neural network architecture (MANN), which is typically (but not by definition) recurrent in its implementation. The model was published in 2016 by Alex Grave ...
s, language processing in
transformers ''Transformers'' is a media franchise produced by American toy company Hasbro and Japanese toy company Takara Tomy. It primarily follows the Autobots and the Decepticons, two alien robot factions at war that can transform into other forms, suc ...
, and LSTMs, and multi-sensory data processing (sound, images, video, and text) in perceivers. There are several types of attention including (
Badhanau Attention
also referred to as Additive Attention, (b) th
Luong Attention
which is known as the Multiplicative Attention, built on top of the Additive Attention, and (c) self-attention introduced in
transformers ''Transformers'' is a media franchise produced by American toy company Hasbro and Japanese toy company Takara Tomy. It primarily follows the Autobots and the Decepticons, two alien robot factions at war that can transform into other forms, suc ...
.


General idea

Given a sequence of tokens t_i labeled by the index i, a neural network computes a soft weight w_i for each t_i with the property that w_i is nonnegative and \sum_i w_i = 1. Each t_i is assigned a value vector v_i which is computed from the
word embedding In natural language processing (NLP), word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the v ...
of the ith token. The weighted average \sum_i w_i v_i is the output of the attention mechanism. The query-key mechanism computes the soft weights. From the word embedding of each token, it computes its corresponding query vector q_i and key vector k_i. The weights are obtained by taking the
softmax function The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...
of the dot product q_i k_j where i represents the current token and j represents the token that's being attended to. In some architectures, there are multiple heads of attention, each operating independently with their own queries, keys, and values.


A language translation example

To build a machine that translates English to French, one takes the basic Encoder-Decoder and graft an attention unit to it (diagram below). In the simplest case, the attention unit consists of dot products of the recurrent encoder states and does not need training. In practice, the attention unit consists of 3 fully-connected neural network layers called query-key-value that need to be trained. See the Variants section below. Viewed as a matrix, the attention weights show how the network adjusts its focus according to context. This view of the attention weights addresses the " explainability" problem that neural networks are criticized for. Networks that perform verbatim translation without regard to word order would have a diagonally dominant matrix if they were analyzable in these terms. The off-diagonal dominance shows that the attention mechanism is more nuanced. On the first pass through the decoder, 94% of the attention weight is on the first English word "I", so the network offers the word "je". On the second pass of the decoder, 88% of the attention weight is on the third English word "you", so it offers "t'". On the last pass, 95% of the attention weight is on the second English word "love", so it offers "aime".


Variants

There are many variants of attention: dot product, query-key-value, hard, soft, self, cross, Luong, and Bahdanau to name a few. These variants recombine the encoder-side inputs to redistribute those effects to each target output. Often, a correlation-style matrix of dot products provides the re-weighting coefficients (see legend).


See also

* * for query-key-value (QKV) attention


References


External links

*
Dan Jurafsky Daniel Jurafsky is a professor of linguistics and computer science at Stanford University, and also an author. With Daniel Gildea, he is known for developing the first automatic system for semantic role labeling (SRL). He is the author of ''The La ...
and James H. Martin (2022
''Speech and Language Processing'' (3rd ed. draft, January 2022)
ch. 10.4 Attention and ch. 9.7 Self-Attention Networks: Transformers *
Alex Graves Alexander John Graves (born July 23, 1965) is an American film director, television director, television producer and screenwriter. Early life Alex Graves was born in Kansas City, Missouri. His father, William Graves, was a reporter for ''Th ...
(4 May 2020)
Attention and Memory in Deep Learning
(video lecture),
DeepMind DeepMind Technologies is a British artificial intelligence subsidiary of Alphabet Inc. and research laboratory founded in 2010. DeepMind was List of mergers and acquisitions by Google, acquired by Google in 2014 and became a wholly owned subsid ...
/ UCL, via YouTube
Rasa Algorithm Whiteboard - Attention
via YouTube {{Differentiable computing Machine learning